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Abstract 

We describe an implementation to solve Poisson's equation for an isolated system on a unigrid mesh using FFTs. The 
method solves the equation globally on mesh blocks distributed across multiple processes on a distributed-memory 
parallel computer. Test results to demonstrate the convergence and scaling properties of the implementation are 
presented. The solver is offered to interested users as the library PSPFFT. 

Program summary 

Program Title: PSPFFT 

Program obtainable from: |http://astro.phy s.utk.edu/_media/codes:pspfft- 1 .Q.tar.gzj CPC Program Library 
CPC Catalogue identifier: AEJK_vl_0 



Licensing provisions: Standard CPC license, |http ://cpc . c s . qub. ac.uk/licence/licence .html 
Programming language: Fortran 95 

Computer: any architecture with a Fortran 95 compiler, distributed memory clusters 
Operating system: Linux, Unix 

RAM: depends on the problem size, approximately 170 MBytes for 48^ cells per process 
Has the code been vectorized or parallelized?: Yes, using MPI 

Number of processors used: arbitrary number (subject to some constraints). It has been tested from 1 up to ~ 13000 
processors. 

Keywords: Poisson's equation, Poisson solver 
Classification: 4.3 Differential Equations 

External routines/libraries: MPI ( http ://www. mcs . anl.gov /mpi7|, FFTW (http://www.fftw.org). 
Silo (https://wci.llnl.gov/codes/silo/) (only necessary for running test problem) 

Nature of problem: Solving Poisson's equation globally on unigrid mesh distributed across multiple processes on 
distributed memory system. 

Solution method: Numerical solution using multidimensional discrete Fourier Transform in a parallel Fortran 95 code. 
Unusual features: This code can be compiled as library to be readily linked and used as a black-box Poisson solver 
with other codes. 

Running time: Depends on the size of the problem, but typically less than 1 second per solve 



1. Introduction 

Some physics simulations require the solution of Poisson's equation with an isolated source distribution and 
a vanishing boundary condition at infinity. A common example is the calculation of the Newtonian gravitational 
potential of a self-gravitating astrophysical system. Poisson's equation is 

V^O(x) = ^(x), (1) 

where S(x) describes the known distribution of the source that generates the potential 0(x). For instance, S(x) is 
proportional to the mass density in the context of Newtonian gravity, and to the charge density in electrostatics. 

Several methods exist to solve the discretized Poisson's equation on a uniform grid. These include, for example, 
multigrid methods, iterative / relaxation methods, several matrix methods, and methods that employ Fourier transforms 
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(for discussion of some of these, see for instance |[I][2l[3l). Here we implement, and extend to three dimensions, a 
particular method |4| (see also 1 1 1 and |5 |) of the latter class. (Another well-known approach for isolated systems 
based on Fourier transforms [6}, also discussed in yj, would not be as straightforward to parallelize and is not 
discussed here.) An advantage of this approach is that discrete Fourier transform algorithms have been well- studied, 
with the Fast Fourier transform (FFT) being the most commonly employed; it requires 0(nlogn) operations, where 
n is the number of elements to transform. Several FFT implementations, some freely available, also exist as libraries 
suitable for end-users. 

The key issue addressed by the implementation described here is the parallelization of an FFT-based algorithm for 
solving Poisson's equation for an isolated system. Obtaining such solutions in three dimensions requires resources 
that at present are most commonly available in distributed-memory parallel computers. Machines of this type allow 
large problems to be decomposed — for example, into multiple spatial subdomains — and distributed across different 
'processes' to be solved in parallel. (Each 'process' contains its own copy of the program; can only access memory 
locations allocated either statically or dynamically by the program; and can communicate with other processes only 
through a specific protocol, the Message Passing Interface (MPI) |T| presently being the most widely used.) While 
Poisson's equation must be solved globally on the computational domain and across multiple processes, most FFT 
implementations require that all data be accessible on a single process; therefore data movement and coordination 
across multiple processes are key ingredients of our FFT-based approach. 

Our Poisson solver for isolated systems in three dimensions, named PSPFFT ('Poisson Solver Parallel FFT'), is 
available as a library that can readily be used by and linked to other programs, subject to a few constraints on numbers 



of processes and mesh points described in Subsec. 2.2.1 We use the FFTW library | 8 |, but our use of it is abstracted 
in such a way that other FFT libraries could be used without having to make widespread changes throughout the code. 
We use MPI to manage data movement and communication across processes, but our calls to message passing routines 
are abstracted as well. 

This paper is organized as follows. Section |2] outlines the algorithm as well as implementation details specific 
to our code. This is followed in Section [3] by instructions on installation and use of the program as a library. Test 
problems illustrating the convergence properties and performance of our implementation are presented in Section]?] 
Section [5] contains concluding remarks. 

2. Solution Method 

2.7. Formulation 

Our problem is to solve Eq. ([T]) with the boundary condition 0(x) ^ as |x| ^ oo (vanishing boundary condition). 
Use of the Green's function 

^W = -j^, (2) 
4n |x| 

which satisfies 

V^G(x) = S(x) (3) 
and obeys the vanishing boundary condition, yields the formal solution 

0(x) = J dx' G(x - x')S(x'). (4) 

This integral may be evaluated with the help of the convolution theorem. Given the Fourier transforms G(k) and S (k) 
of G(x) and S (x), the Fourier transform of the potential is 

(I)(k) = G(k)5(k). (5) 

The potential 0(x) is then obtained by an inverse Fourier transform. 

When the Fourier transforms are to be done with FFTs, use of a regular mesh with the same spacings in each 
dimension is most natural; but in principle it should be possible to use any mesh for which a coordinate transformation 
can bring the mesh point positions to triplets of integers. For instance, to allow for a regular mesh with numbers of 
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Figure 1: A two-dimensional illustration (redrawn after | 4 |) of the mesh doubling used to solve for the potential due to an isolated source. The 
'active' mesh on the lower left is the original computational domain with Ux x Uy cells. 



mesh points n^, riy, and unequal (but constant) mesh point spacings hx, hy, in the three dimensions, Eqs. ([2])-(|4]) 
become 

hr hy hj 

G(x) = ' " =, (6) 

0(x) = J" dx! G(x - xO ^ (xO, (8) 

where the values of the transformed coordinates x corresponding to the mesh points are triplets of integers ranging 
from {) io Hx - I, Uy - \, - 1 in the three dimensions respectively. (Note that the Jacobian of the coordinate 
transformation has been absorbed into the numerator of Eq. ([6]), with the denominator still being An times the distance 
from the origin.) 

The implementation of boundary conditions at infinity on a necessarily finite computational domain can be handled 
'exactly', that is to say, with only discretization error, via a mesh doubling procedure and use of the standard periodic 
form of the discrete Fourier transform |4| (see also |1| and |5 |). Figure [T] illustrates this in two dimensions. The 
'active' portion of the mesh corresponds to the original computational domain, while the 'inactive' portions are those 
arising from doubling the extent of the mesh in each dimension. The source distribution is set to its known physical 
values in the active region, and to zero in the inactive regions. Indexing cells by integer triplets p, q, r (the position of 
the mesh points in transformed coordinates x), the Green's function in the active and inactive regions is 



< p,q,r < nx.ny.fiz 
p-\- q-\-r i^O 



Gp,q,r = -hx hy hz (4;:)"^ {hx^p^ + hy^q^ + /z/r^) ^'^ 

G2nx-p,q,r ~ Gp2ny-q,r ~ ^/?,^,2n2-r ~~ Gr2nx-p,2ny-q,2nz-r ~ Gp^q^r 
G2nx-p,2ny-q,r ~ Gr2nx-p,q,2nz-r ~ G p_^2ny-q,2nz-r ~ Gp^qj 

Go,o,o = -hx hy hz [47r mmQix, hy, /zj] ^ . (9) 

That is, the Green's function in the active region follows Eq. ([6]), and is set up in the inactive regions in such a way 
that a periodic replication of the doubled mesh in all dimensions yields Eq. ([6]) in the entire region -Ux, -riy, -n^ < 
p,q,r < Hx, riy, surrounding the origin. The value for Go,o,o regularizes the singularity at the origin by assigning it 
to be equal to the largest off'-origin value on the mesh; for hx = hy = = 1, this reduces (up to conventions for sign 
and the An factor) to the prescription given in 14J . 

2.2. Program Implementation 

PSPFFT is written in Fortran 95, using an object-oriented programming style to a degree convenient and possible 
within that language. The library FFTW 1 8 1 provides our basic EFT functionality, and we use MPI |i7J to implement 
parallelization across multiple processes. 



Figure 2: An illustrative brick decomposition in three dimensions for a computational domain assigned to twenty-seven processes. Only eleven 
bricks are shown to simplify the illustration. The bricks are labeled with the rank of the process that 'owns' them. (Process rank numbering here 
and in the following two figures begins with 1, rather than as in MPI and internally in the code.) 

2.2.7. Domain Decomposition and Storage 

In many physical simulations the problem size is large enough that the computational domain is spatially decom- 
posed into multiple subdomains, each assigned to a different process. In the general case the extent of the domain — 
and/or the number of mesh points — need not be the same in all dimensions. For the purpose of minimizing commu- 
nications, decompositions yielding subdomains with low surface-to-volume ratio are favorable in situations (such as 
hydrodynamics) in which nearest-neighbor information is required. Our code assumes that the source S (x) of Eq. 
([T]) is available, and that the potential 0(x) is desired, in a simple 'brick' decomposition: in three dimensions, the 
computational domain is divided in each dimension by ni^ = ^^fn^, the cube root of the number of processes np. For 
simplicity we require np = n^^ to be a perfect cube (in three dimensions), and the number of mesh points in each 
dimension to be evenly divisible by n^. Figure |2] illustrates the brick decomposition. 

The brick decomposition is not convenient for FFTs, however, because a single transform is most naturally and 
efficiently performed on data accessible to a single process; therefore our solver has its working storage arranged 
in 'pillars' rather than bricks. (A distributed parallel FFT that leaves data in place in the brick decomposition — and 
therefore requires parallelized one-dimensional FFTs — has also been mentioned in the literature |9|. In contrast, our 
approach allows in principle the use of any of several widely available and highly optimized serial FFT libraries.) 
The 'length' of what we term 'x pillars' spans the full extent of the computational domain in the x direction. The 
'width' of the x pillars is their extent in the y direction, which is l/nt times the y extent of the bricks. This implies 
another constraint imposed by our solver: the number of mesh points ny/n^ spanned by the y extent of a brick must 
itself be evenly divisible by n^. The 'height' of the x pillars, which is their extent in the z direction, is the same as the 
extent of the bricks in the z direction. By similar construction (and with similar constraints on n^ and nx), we have 
y pillars and and z pillars whose (width, height) are taken to be their extents in (z, x) and (x, y) respectively. These 
'pillar decompositions' cover the same total volume and contain the same total number of mesh points as the brick 
decomposition, as illustrated in Fig. [3] Finally, we note that the lengths of the pillars are doubled as necessary to 
accommodate the mesh doubling procedure, so that the pillars span both the active and inactive portions of the mesh. 

Because of the row-major nature of Fortran array storage, a pillar's length, width, and height correspond in our 
code to the first, second, and third dimensions of a rank-three array. This allows a (width x height) -number of one- 
dimensional FFTs to be performed efficiently on contiguous data, specifically on lines containing a number of data 
points equal to the pillar length. The construction of pillars from bricks and vice-versa requires data movement across 
diff'erent processes. Using MPI, this is accomplished by creating a subcommunicator for each group of processes that 
need to communicate data among themselves, as illustrated in Fig. [3] For each group, a call to MPI_AllToAll can then 
be made with the group's subcommunicator in order to achieve the construction of pillars. These subcommunicators 
are saved to be used in the reverse process of deconstructing pillars back into bricks. 
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Figure 3: An illustration of the transformation from brick decomposition to x pillars from a three-dimensional mesh assigned to twenty-seven 
processes. Here only the first (lowest in the z direction) xy slab of bricks is shown; other slabs independently follow the same transformation. 
The left panel shows the whole computational domain decomposed into bricks, demarcated by solid lines and assigned to processes labeled by the 
numbers in solid circles. Dashed lines in the left panel mark the chunks of data that need to be sent to the processes labeled by the numbers in 
dashed squares in order to build the pillars. As indicated by the dotted boundaries, processes [1,2,3], [4,5,6], and [7, 8,9] form three separate 
groups, each with its own subcommunicator within which chunks of data are exchanged during the construction of the x pillars. In the right panel, 
we see that each process (again, labeled by numbers in solid circles) also owns a pillar. The boundaries between pillars are now marked by solid 
lines, and the dashed lines indicate the chunks of data that the process received from other processes labeled by numbers in dashed squares. 

2.2.2. Multidimensional Transforms 

A multidimensional FFT can be accomplished as a sequence of sets of one-dimensional transforms. The number 
of required operations is still of 0(n log n), where n = n^nyn^ is the total number of mesh points. One possibility 
for achieving computational efficiency is to transpose data between transforms in subsequent dimensions in order to 
achieve contiguity of memory access in each dimension. In any case, such transpose operations become a necessity 
in a distributed memory environment if parallelization of individual one-dimensional transforms is to be avoided. 

The sequence of transforms and transposes is as follows. Data are initially loaded into the solver's x pillars: during 
initialization the Green's function is set up directly in the x pillars according to Eq. ([9]), while the source is transferred 
from the brick decomposition to the x pillars at the beginning of every solve. With data loaded in x pillars, multiple 
one-dimensional transforms in the x dimension are simultaneously performed by all processes. The y pillars are then 
populated, independently in separate xy 'slabs', as illustrated in Fig. |4] For each slab a separate MPI group with 
its own subcommunicator is created; thus are are n^ = ^^fn^ subcommunicators, each of which has n^^ processes. 
Within each subcommunicator a call to MPI_AllToAll transposes the data from x pillars to y pillars so that FFTs 
can be performed in the y direction. Similar transposes in yz slabs allow FFTs to be performed in z pillars. Here 
the multiplication of the transforms of the Green's function and the source takes place as well, with the resulting 
Fourier-space solution of the Poisson equation overwriting the transformed source distribution. A reverse sequence of 
backward transforms and transposes gets the solution (modulo a normalization factor due to the multiple transforms) 
back into real space, stored in the x pillars. Finally the solution is redistributed from the active portion of the x pillars 
to the brick decomposition, overwriting the storage in which the source was delivered. 

This sequence of transforms and transposes makes use of permanent storage for the source distribution in x, y, 
and z pillars, which at the end of the solve is reused to store the potential. This same storage is then updated with 
a new source distribution on the next call to the Poisson solver. The transform of the Green's function is computed 
only once, and stored permanently in z pillars, when the solver is initialized. Computation of the transformed Green's 
function requires x pillars and y pillars, but these are deallocated at the end of the initialization. 

3. Installation and Usage 

3.1. Installation 

PSPFFT is distributed as a gzip-compressed . tar file. Upon uncompression and extraction, the top-level directory 
PSPFFT has a README file and four subdirectories: Build, Conf ig. Source, and Test. Complete installation instruc- 
tions are given in the README file. During the build process, object files (with a . o extension) and Fortran module files 
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Figure 4: An illustration of the transpose of x pillars to _y pillars on a three-dimensional mesh assigned to twenty-seven processes. As before, only 
the first (lowest in the z direction) xy slab is shown. The solid rectangles demarcate data owned by different processes, labeled by numbers in solid 
circles. Dashed lines mark chunks of data that need to be sent to (left panel) and received from (right panel) processes labeled by numbers in dashed 
boxes. In this example, a slab with 9 processes forms a single MPI group with its own subcommunicator, and the transpose is accomplished with a 
call to MPI_AlltoAll. 



(usually with a * . mod extension) are placed in the Build directory. The Conf ig directory includes a configuration file 
in which system-specific values for the build and installation process are specified; these include the name of the MPI 
compiler wrapper, the location of the FFTW library, and the location of the installation. The directories Source and 
Test contain the PSPFFT source code and a test program respectively. The test program solves the problem described 
in Seeds 

Brief installation instructions are as follows. After replacing the system-specific values in the file Conf ig/Makef ile_Conf ig, 
the user should change to the Build directory to build and install the library by typing 

> make 

> make install 

This creates and copies the library file libpspf f t . a and the Fortran module files to the installation directory. The 
library file can then be linked to programs that need the solver. 

3.2. Usage 

This code snippet and the explanation that follows illustrate how to use the PSPFFT library in a Fortran program. 
1 program Main 



2 


use PSPFFT 


3 


implicit none 


4 
5 


include 'mpif.h' 


6 


integer : : & 


7 


Error 


8 


integer, dimension(3) :: & 


9 


nCellsPerProcess , & 


10 


nTotalCells 


11 


real(KR), dimension (3) :: & 


12 


CellWidth 


13 


type ( ArrayReal_3D_Base ) , dimension(l) :: & 


14 


SourceSolution 


15 


type (PSPFFT .Form ) , pointer :: & 


16 


PoissonSolver 


17 




18 


call MPI_INIT( Error) 


19 


! — Add lines to set nCellsPerProcess , nTotalCells , and CellWidth 


20 


call Initialize ( SourceSolution , nCellsPerProcess) 


21 


! — Add lines to fill in SourceSolution with the source distribution 


22 


call Create(PoissonSolver , CellWidth, nTotalCells, MPLCOMM_WORLD , & 
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23 VerbosityOption = CONSOLE_INFO_2) 

24 call Solve(PS, SourceSolution ) 

25 / — Add lines that use the potential returned in SourceSolution 

26 call Destroy(PS) 

27 call Finalize ( SourceSolution ) 

28 call MPI_FINALIZE( Error) 

29 end program Main 

All the derived data types, parameters, and subroutines used in the above example (except MPI-related variables and 
subroutines, such as MPI_COMM_WORLD and MPI_INIT) are defined by the Fortran module PSPFFT, which is used by 
the main program on line 2. After initializing MPI, the user should specify the number of cells per process, the total 
number of cells across all processes, and the cell width — all of which are arrays specifying these quantities in each of 
the three dimensions — as indicated on line 19. 

The variable SourceSolution is an array of derived data type ArrayReal_3D_Base. This is essentially a facility 
to make an array of rank-three arrays. ArrayReal_3D_Base has the following definition: 

type, public :: ArrayReal_3D_Base 

real(KR), dimension (:,:,:) , allocatable :: & 
Data 

end type ArrayReal_3D_Base 

The first, second, and third dimensions in SourceSolution7oData correspond to the x, y, and z dimensions on the 
mesh subdomain owned by a particular process. Its value specifies the source in that particular cell. Therefore this 
variable initially specifies the source distribution on the mesh which should be filled in by user. Line 20 simply 
initializes SourceSolution by allocating the necessary storage. 

The call to the Create () subroutine on line 22 allocates storage for the PoissonSolver variable and does all the 
necessary initializations. An optional argument VerbosityOption controls the verbosity of the messages the solver 
prints to STDOUT. Acceptable values (defined as parameters), in decreasing order of verbosity, are C0NS0LE_INF0_2, 
C0NS0LE_INF0_1, CONSOLE_WARNING, and CONSOLE_ERROR. If the argument is omitted, the verbosity level is set to 
CONSOLE_ERROR (the least verbose) by default. It is possible to replace the fourth argument (i.e. MPI_COMM_WORLD in 
this example) with a diff'erent MPI communicator which specifies a subgroup of processes that should be involved in 
solving the Poisson equation, subject to the constraints on numbers of processes and mesh points described in Section 

[223] 

Line 24 solves the Poisson equation with the source distribution passed as an argument. Upon return, the variable 
for the source is overwritten with the values of the potential. 

The allocatable storage is deallocated by the calls to Destroy () and Finalize in lines 26 - 27. 

All public subroutines exposed by PSPFFT are defined as generic interfaces using the function and subroutine 
overloading feature of Fortran 95. Therefore, other subroutines with the same names may be defined and will not 
conflict with the library as long as they are defined as generic interfaces. 

Assuming the above code example is in the file Main . f 90, compilation and linking to PSPFFT are accomplished 
by a command like the following: 

> mpifQO -L $ (INSTALL) /lib -1 pspfft -I $ (INSTALL) /include Main. f 90 

where $ (INSTALL) is the installation location of the library, and mpif 90 may also need to be replaced by a system- 
specific MPI compiler wrapper. 

4. Numerical Results 

In this section we present a few illustrative test problems, investigate the numerical convergence of our code with 
respect to mesh resolution, and explore its scaling behavior on a distributed-memory parallel computer. The chosen 
test problems are broadly similar to systems encountered in astrophysical simulations, except that they have analytic 
solutions; this allows us to verify the correctness of our implementation. 
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Figure 5: The potential of a unit sphere with uniform mass density p = I. The figure is a slice through the three-dimensional mesh crossing the 
origin to show the XF-plane. The solid black line indicates the surface of the sphere at radius R = 1. The mesh resolution is 256 cells in each 
dimension. 

4.1. Gravitational potential of spherical uniform mass 

Consider the gravitational potential of a sphere of radius R and uniform mass density p. The source is AnGp, and 
the potential as a function of radius r from the center of the sphere has a simple analytic solution: 



0(r) 



--7iGp(?>R^ -r^) f or r<R 

4 R^ 

— TiGp — for r > R. 
3 r 



(10) 



We compute the potential for a sphere of radius R = I and mass density p = 1 in a Cartesian computational box 
with inner and outer boundaries at [-1.2, +1.2] respectively in all dimensions. The sphere is centered on the origin of 
the coordinate system. For each mesh resolution, we calculate the potential in two ways: first, by using the analytic 
solution above with r = ^x^ + + z^, and second, by using our implementation of the Poisson solver. By varying 
the mesh resolution we can check the convergence properties of our solver with respect to spatial resolution. The 
potential for this test problem is shown in Figure [5] 

We use the usual definition of the Li norm to measure the error of the potential O computed by our solver relative 
to the analytic solution Oq: 

^ |o(x/,3;y,Zyt) - Oo (x/,3;y,zyt)| 

^'^) = V 1^ ( \\ • ^''^ 

Zj\'^o[xi,yj,Zk)\ 

ij,k 

The summation is over all cells, indexed by /, j, k,. The Li norm gives a single number as a quantitative measure 
of the error for a given mesh resolution; in contrast. Figure [6] illustrates the distribution of the relative error on the 
grid for a particular resolution, which is representative (by diff'erent constant factor) of the error distribution for other 
resolutions. 

Figure [7] shows the convergence of our solver (relative error as a function of resolution) for this problem. The 
convergence of the error trend is better than first order. 



8 



-0,0001741 
^-0,00013^6^ 
R- 8,705e-05 

^-0,000 _ 

Max; 0,00017! 
Min: 0,000 



M,353e-05 



Y o.oH 




Figure 6: The relative error, as described by equation [U] but without the summation over all cells, of the potential of a spherical uniform-density 
mass. The figure is a slice through the mesh showing the XF-plane. The mesh resolution is 256 cells in each dimension. The largest errors are on 
the surface of the sphere due to the nature of the Cartesian grid. 
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Figure 7: Convergence test for the potential of a spherical uniform-density mass. The Li-norm relative error of the computed potential as compared 
to the analytical solution is plotted as function of the following mesh resolutions: [48^, 144^, 288^, 384^, 576^, 768^, 1 152^]. The dot-dashed and 
dashed lines are references for first- and second-order error convergence respectively. 
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4.2. Gravitational potential of homogeneous spheroid 

A more general case of the previous test problem is the potential of a spheroid with uniform density p. The 
spheroid is formed by an ellipse centered at the origin and rotated about the z axis, and is described by the equation 

^4 = 1^ (12) 

a^ 

where a and b are the semi-diameters of the spheroid. The spheroid is oblate when a > b, with eccentricity e defined 
as 

The gravitational potential of a homogeneous spheroid 1101 is a simpler case of the potential of a homogeneous 
ellipsoid given in ITTl . Inside the spheroid, 

0(jc, y, z) = -TiGp [a {la^ -x^-y^) + B (b^ - z^)] , (14) 

where 



Vl -^2 _^ 



.2 

A = — — sin ^e r^, (15) 



5=4-^^si„-. (16) 

e^ 



Outside the spheroid the potential is given by 



0(x,j,z) = -InpG— 
e 



(17) 



with 



/z^— (18) 



in which /I is the positive root of the equation 

+ z^ 



a^ + A b^ -\- A 



= 1. (19) 



We compute the potential for a spheroid with eccentricity e = 0.9 and semi-major axis a = 0.5 on a Cartesian 
computational box of size two in each dimension. As before, we setp = 1. Figure [8] shows the computed potential for 
a particular mesh resolution. 

As in the previous test problem, we consider the error of the numerical solution relative to the analytic solution. 
Figure [9] illustrates the spatial distribution of the error for a particular resolution. The convergence of the error 
(specifically, the Li-norm) with higher resolution is shown in Fig. [TOj we see that on this problem our solver has 
higher than first order convergence, but less than second order. 

4.3. Gravitational potential of homogeneous binary spheroid 

In this problem we place two separate homogeneous spheroids in the computational domain. The extent of the 
domain in the x direction is twice that of the previous test problem, so that the x dimension has inner and outer 
boundaries at coordinates ±2. The spheroids are centered on coordinates [+1,0,0]. In order to maintain the same 
eff'ective resolution as our previous test problem, we double the number of cells in x dimension only, resulting in a 
rectangular computational box. 



Figure 1 1 shows the potential for this test problem, which is the sum of the potentials of the individual spheroids. 



Thus the analytic solution for this test problem is obtained by modifying the analytic solution found in Section [42] to 



account for the shift of the spheroids' centers from the origin. This is done by substituting x - c for x in Eqs. 14 17 



and 19 where c is the x coordinate of the center of spheroid. 



As before, we vary the mesh resolution for this test problem to do a convergence test of our solver. This is shown 



in Fig. 12 The convergence trend is similar to those of the previous test problems, namely, our solver converges better 



than first order, but does not reach second order convergence. 
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Figure 10: Convergence test of potential for a homogeneous spheroid with mesh resolutions [48^ 144^288^384^576^768^ 1152-^]. 



4.4. Scaling 

We test the weak scaling behavior of our solver by increasing the number of parallel processes while increasing 
the mesh resolution to maintain a constant amount of work per process. The total CPU time per solve can be expressed 



(20) 



where rip is the number of processes and /(rip) = constant for perfect weak scaling. In Fig. 13 we plot the wall time 
per solve 



- wall 



Tcpv/rip = fiup) 



(21) 



for the homogeneous spheroid and binary spheroid problems, averaged over 2500 solves (solid lines). The scaling 
tests were carried out on a Cray XT-4 with quad-core 2.1 GHz AMD Opteron 1354 (Budapest) processors and 8 GB 
of DDR2-800 memory per node. For compiling the code, we used the vendor-provided PGI compiler version 10 and 
FFTW version 3.2. Also shown are idealized theoretical trends in the absence of communication costs (dashed lines). 
For the FFT alone we expect 

Tcpu = a He log He, (22) 

where a is a constant and ric is the total number of cells. Rewriting in terms of Uc = ric/pHp, where ric/p is the number 
of cells per process, we have 

T^cpu = bnp logiric/pHp), (23) 
where Z? is a new constant. Thus the theoretical expectation we plot is 



T'wall = b logiUc/pHp), 

for n 



21 



1 . The number of cells per process is n, 



'dp 



(24) 



48-^ 



with the normalization b set by equating Eq. 24 to Eq. 

for the homogeneous spheroid and ridp = 96 x 48^ for the binary spheroid. We attribute the difference between the 
measured and theoretically ideal trends to communication costs that rise with the number of processes. We do not 
consider this extra cost severe, as the time per solve is still about 1 second with - 13, 000 processes. 
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Figure 11: The potential of a homogeneous spheroid binary. Each spheroid has mass density p = I, eccentricity e = 0.28, and semi-major axis 
a = 0.5. The solid black lines indicate the spheroids' surfaces. The mesh resolution is 768 x 384 x 384 cells. 
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Figure 12: Convergence test of potential for a binary spheroid with uniform mas with the same effective mesh resolutions as the previous test 
problems. 
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Figure 13: Wall time per solve r^aii plotted against number of processes rip, demonstrating the weak scaling behavior of PSPFFT. The homogeneous 
sphere and spheroid test problems assign 48 x 48 x 48 cells per process (black solid curve connected with squares), while the binary spheroid test 
problem assigns 96 x 48 x 48 cells per process due to the doubling of computational domain in one dimension (red solid curve connected with 
circles). The theoretically expected trend for the FFT alone — without communication costs — is shown by the dashed lines, whose vertical offsets 
are set to match the measured values forrir, = 1. 
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5. Conclusion 



We have described our implementation of a parallel solver for Poisson's equation for an isolated system. Our 
solution method uses Fourier transforms on a distributed unigrid mesh; in particular we use the FFTW library. We 
employ a conmion protocol, Message Passing Interface (MPI), for conmiunication between processes during a global 
solve on a distributed-memory system. We have shown test problems, numerical convergence, and the weak scaling 
behavior of our solver. We distribute the solver as a library, PSPFFT, which is suitable for use as part of a parallel 
simulation system. 
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